Skip to content

feat: add StringUtils, DateTimeUtils, PathUtil, OptionsUtils, RapidJsonUtil, and Jsonizable utilities#23

Open
dalingmeng wants to merge 1 commit into
apache:mainfrom
dalingmeng:feat/add-string-json-utils
Open

feat: add StringUtils, DateTimeUtils, PathUtil, OptionsUtils, RapidJsonUtil, and Jsonizable utilities#23
dalingmeng wants to merge 1 commit into
apache:mainfrom
dalingmeng:feat/add-string-json-utils

Conversation

@dalingmeng
Copy link
Copy Markdown
Contributor

Purpose

No Linked issue.

Introduce string/JSON/path utility modules to the apache/paimon-cpp codebase. These modules provide common string manipulation, date-time parsing, path operations, configuration option parsing, and JSON serialization helpers.

Introduce DateTimeUtils for timestamp parsing and formatting (date_time_utils.h)
Introduce TimezoneGuard test utility for timezone-dependent tests (timezone_guard.h)
Introduce StringUtils for string splitting, trimming, type conversion, and formatting (string_utils.h, string_utils.cpp)
Introduce RapidJsonUtil for JSON document read/write helpers (rapidjson_util.h)
Introduce PathUtil for file path manipulation and generation (path_util.h, path_util.cpp)
Introduce OptionsUtils for parsing configuration key-value options (options_utils.h)
Introduce Jsonizable CRTP base for JSON serialization/deserialization (jsonizable.h)

Tests

date_time_utils_test.cpp — Timestamp parsing, formatting, timezone handling
string_utils_test.cpp — String split, trim, conversion, and format operations
rapidjson_util_test.cpp — JSON document creation, read, and write
path_util_test.cpp — Path join, normalize, and UUID-based temp path generation
options_utils_test.cpp — Configuration option key-value parsing
jsonizable_test.cpp — JSON serialization and deserialization round-trip

API and Format

No API or format changes.

Documentation

No documentation changes

Generative AI tooling

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the utility modules. I found a blocking issue in PathUtil: NormalizePath currently turns valid URI paths without an authority, such as file:///tmp or hdfs:///warehouse, into file:/tmp / hdfs:/warehouse. In URI syntax the triple slash is significant here because it means an empty authority plus an absolute path; collapsing it changes the URI shape and can break filesystem/catalog path handling. Please preserve the scheme:///path form when the input has an explicit empty authority, and add tests for file:///.../hdfs:///... normalization.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction for my previous NormalizePath comment: I rechecked Apache Paimon Java Path.toString(). For new Path("file:///tmp") / new Path("hdfs:///warehouse"), the parsed authority is null and toString() emits file:/tmp / hdfs:/warehouse, not the triple-slash form. Since this C++ implementation returns Path::ToString() and matches that Java behavior, the NormalizePath URI-shape issue I raised should be ignored. No change is needed for that part.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed PR #23 after correcting my previous NormalizePath concern. The path normalization behavior is compatible with Java Paimon and does not need to change. I found two other issues that should be fixed before merging: date/timestamp parsing currently accepts impossible dates by letting libc normalize them, and JSON deserialization of narrow integer types can silently truncate out-of-range values.

if (ss.fail()) {
return Status::Invalid(fmt::format("failed to convert string '{}' to date", str));
}
std::time_t time = timegm(&timeinfo);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This accepts invalid calendar dates because std::get_time only fills tm, and timegm/mktime normalizes out-of-range fields instead of rejecting them. For example, StringToDate("2019-02-29") will become 2019-03-01, and StringToTimestampMillis("2023-02-30 00:00:00") will become 2023-03-02 00:00:00. Java Paimon's DateTimeUtils.parseDate explicitly validates month/day ranges and rejects invalid dates, so this can persist or compare wrong metadata values. Please validate the parsed year/month/day before converting, or compare the normalized tm back to the parsed fields, and add tests for non-leap Feb 29 / Feb 30.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more inline issue for the JSON integer deserialization path.

if (!value.IsInt()) {
throw std::invalid_argument("value must be int");
}
return static_cast<int8_t>(value.GetInt());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The small-integer JSON readers only check RapidJSON's broad IsInt() / IsUint() predicates, then cast to int8_t / uint8_t / int16_t / uint16_t. This silently wraps out-of-range JSON values, e.g. deserializing 500 as int8_t yields a truncated value instead of failing. Since StringUtils::StringToValue already rejects the same overflow cases, JSON deserialization should be consistent: check value.GetInt() / GetUint() against std::numeric_limits<T>::min()/max() before casting, and add overflow tests for the narrow integer types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants